Cross-dataset rules

Warning Best practice for cross-table joins is to define a view and scan the view.
Only in circumstances when a view cannot be created should you define cross-table joins with 2 separate datasets (DQ jobs) and express the join in the rule.

If you're doing multiple lookups, this will improve long-term performance and consolidate maintenance.

Note Cross-dataset rules require -addlib.

When a cross-data set rule uses two connections, be sure the jars are in the -lib or -addlib directory.

In-clause (single column)

Copy
select * from @table1 where id 
not in ( select id from @table2 )

Except (multi-column)

Copy
select id, app_id, email, guid_num from @table1
EXCEPT
select id, app_id, email, guid_num from @table2

Referencing secondary datasets

When you create a Freeform SQL rule with secondary datasets, there are three important points to consider in order to avoid missing data set exceptions.

  • Ensure that the name of your first data set matches the name of the main data set on which your rule is created.
  • The column names in your table must be exact matches with the syntax of your query. >> If the column name uses CAPS, then your query must also use CAPS.
  • Ensure that any data set referenced in your rule maintains a score of greater than or equal to (>=) a passing score of 75.

Note These three points apply to all rules, not just Freeform Rules.

Note When you run a Job that references a secondary dataset, a temporary Job to load the secondary dataset appears on the Jobs page. This Job temporarily changes the dataset name while the Spark process runs. When you click the temporary Job after its status updates to Finished, the Findings page may show Rows: undefined in the Profile details. If this happens, you can refresh the Jobs page until the Job reverts to its actual name, and then click the Job to view the results.

Joins

A join lets you reference two or more datasets in one rule. Joins are useful when you want to compare the values of a data set from a previous, or when you want to verify that all values are valid across your datasets.

Join statements

Available join types between multiple datasets

  • WHERE tableA.id = tableB.id style
  • INNER JOIN
  • LEFT <OUTER> JOIN
  • RIGHT <OUTER> JOIN

Joining other datasets

  • Get the historical state of the same dataset.
    Syntax:@t<n>, where the parameter n means, the historical run (number) of the base dataset (marked with @<dataset_name>) to instruct the temporary job created at runtime to target. The following table demonstrates what each of the following @tn value examples mean:
    @tn valueDescription
    @t1Loads the data from the previous scan.
    @t2Loads the data from the second to last scan.
    @t3Loads the data from the third to last scan.
    @t4Loads the data from the fourth to last scan
  • Get a different dataset
    Syntax: @<other_dataset_name>

Note When you use a rule with @t1 syntax, a temporary job in the Jobs queue appears. This temporary job displays a LOAD status to allow the run that the @t1 is targeting to load and use it for the rule.

Join example

Copy
SELECT
    *
FROM
    @table1 A
    LEFT JOIN @table2 B ON A.id = B.id
where
    B.id is null OR (A.email != B.email)

Limitations

  • There is a known limitation where a Cartesian join which produces a greater number of rows than the initial dataset shows a score greater than 100%. There is not a workaround other than the query for the DQ job be the query that produces that larger number of rows.
  • @t1 rules where Cartesian joins are created are not supported.
  • Total records are calculated by adding the total number of breaking and passing records. Therefore, when referencing a secondary dataset in a native rule, the extra rows from the secondary dataset are included in the query results on the Findings page, skewing the total rows calculation and the percentage of breaking records.